Fault-tolerant scheduling on parallel systems with non-memoryless failure distributions

نویسندگان

  • Mohamed-Slim Bouguerra
  • Derrick Kondo
  • Fernando Machado Mendonca
  • Denis Trystram
چکیده

As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Most often, in real systems, failure rates are increasing or decreasing over time. Considering non-memoryless failure distributions, we study a bi-objective scheduling problem of optimizing application makespan and reliability. In particular, we determine whether one can optimize both makespan and reliability simultaneously, orwhether onemetricmust be degraded in order to improve the other. We also devise scheduling algorithms for achieving (approximately) optimal makespan or reliability.When failure rates decrease, we prove thatmakespan and reliability are opposingmetrics. In contrast, when failure rates increase, we prove that one can optimize both makespan and reliability simultaneously. Moreover, we show that the largest processing time (LPT) list scheduling algorithm achieves good performance when processors are of uniform speed. The implications of our findings are the accelerated completion and improved reliability of parallel jobs executed across large distributed systems. Finally, we conduct simulations to investigate the impact of failures on the performance, which is done using an actual application of biological sequence comparison. © 2014 Elsevier Inc. All rights reserved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Proactive Fault Tolerant Approach for Scheduling in Computational Grid

Grid Computing provides non-trivial services to users and aggregates the power of widely distributed resources. Computational grids solve large scale scientific problems using distributed heterogeneous resources. The Grid Scheduler must select proper resources for executing the tasks with less response time and without missing the deadline. There are various reasons such as network failure, ove...

متن کامل

An Efficient Fault Tolerant Scheduling Approach for Computational Grid

Grid computing serves as an important technology to facilitate distributed computation computational grids solve large scale scientific problems using heterogeneous geographically distributed resources. Problems like dispatching and scheduling of tasks are considered as major issues in computational grid environment. The Grid Scheduler must select proper resources for executing the tasks with l...

متن کامل

A Survey on Fault Tolerance Mechanisms for job scheduling in Grid computing

Grid computing is defined as a hardware and software infrastructure that enables sharing of coordinated resources in a dynamic environment. In grid computing, the probability of a failure is much greater than parallel computing. Therefore, the fault tolerance is an important issue in order to achieve reliability, availability of resources. When scheduling a job, the resource uses both average f...

متن کامل

Real-time Fault-tolerant Scheduling Algorithm for Distributed Computing Systems

This article proposes a Distributed Realtime Fault-tolerant model, priority Real-time Fault-tolerant algorithm and computational architecture of Distributed Real-time Fault-tolerant. According to this model, the problem of how to schedule a weighted Directed Acyclic Graph (DAG) in Distributed computing system for high reliability can be solved in the presence of multiprocessors faults. When som...

متن کامل

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Parallel Distrib. Comput.

دوره 74  شماره 

صفحات  -

تاریخ انتشار 2014